library(dplyr)
library(ggplot2)
library(magrittr)
The CPS_2018 data set contains wage data collected by the Current Population Survey (CPS) in 2018. Using these data, we can explore the relationship between annual wages and marital status (married or single) among 18-34 year olds. The original codebook is here. And the data reflect how the government collects data regarding employment.
# Load the data
CPS_2018 <- read.csv("https://www.macalester.edu/~ajohns24/data/CPS_2018.csv")
# Let's just look at 18 - 34 year olds and people that make under 250,000
CPS_2018 <- CPS_2018 %>%
filter(age >= 18, age <= 34) %>%
filter(wage < 250000)
We’ll start with a small model of wage vs marital status:
# Visualize the relationship
ggplot(CPS_2018, aes(y = wage, x = marital)) +
geom_boxplot()
# Model the relationship
CPS_model_1 <- lm(wage ~ marital, CPS_2018)
summary(CPS_model_1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46145.23 921.062 50.10002 0.000000e+00
## maritalsingle -17052.37 1127.177 -15.12839 5.636068e-50
Let’s start simple, by controlling for age (an imperfect measure of experience) in our model of wages by marital status. NOTE: Pause to think about the coefficients before answering the questions below.
# Construct the model
CPS_model_2 <- lm(wage ~ marital + age, CPS_2018)
summary(CPS_model_2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19595.948 3691.6998 -5.308110 1.184066e-07
## maritalsingle -7500.146 1191.8526 -6.292847 3.545964e-10
## age 2213.869 120.7701 18.331265 2.035782e-71
geom_smooth matches our model assumptions.# HINT CODE: don't change
ggplot(___, aes(y = ___, x = ___, color = ___)) +
geom____(size = 0.1, alpha = 0.5) +
geom_line(aes(y = CPS_model_2$fitted.values), size = 1.25)
# Your solution
ggplot(CPS_2018, aes(y = wage, x = age, color = marital)) +
geom_jitter(size = 0.1, alpha = 0.5) +
geom_line(aes(y = CPS_model_2$fitted.values), size = 1.25)
About \(\$7500\).
maritalsingle coefficient? On average…
We’ve seen that among all workers, the wage gap of single vs married workers is $17,052. We also saw that age accounts for some of this difference. Let’s see what happens when we control for even more workforce covariates: model wages by marital status while controlling for age and years of education:
CPS_model_3 <- lm(wage ~ marital + age + education, CPS_2018)
summary(CPS_model_3)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64898.607 4099.8737 -15.829416 2.254709e-54
## maritalsingle -6478.094 1119.9345 -5.784351 7.988760e-09
## age 1676.796 116.3086 14.416777 1.102113e-45
## education 4285.259 207.2158 20.680173 3.209448e-89
ggplot(CPS_2018, aes(x = age, y = wage, color = marital, size = education)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = CPS_model_2$fitted.values), size = 1.25)
2 planes, one for each category.
CPS_model_3, the coefficients contain the information we want! How can we interpret the education coefficient? On average…
maritalsingle coefficient? On average…
Let’s control for another workforce covariate: model wages by marital status while controlling for age (quantitative), years of education (quantitative), and the industry in which one works (categorical):
CPS_model_4 <- lm(wage ~ marital + age + education + industry, CPS_2018)
summary(CPS_model_4)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -52498.857 7143.8481 -7.3488206 2.533275e-13
## maritalsingle -5892.842 1105.6898 -5.3295615 1.053631e-07
## age 1493.360 116.1673 12.8552586 6.651441e-37
## education 3911.117 243.0192 16.0938565 4.500408e-56
## industryconstruction 5659.082 6218.5649 0.9100302 3.628760e-01
## industryinstallation_production 1865.650 6109.2613 0.3053806 7.600964e-01
## industrymanagement 1476.884 6031.2901 0.2448704 8.065727e-01
## industryservice -7930.403 5945.6509 -1.3338158 1.823603e-01
## industrytransportation -1084.176 6197.2462 -0.1749448 8.611342e-01
facet_wrap(~ ___) ala Activity 5.ggplot(CPS_2018, aes(x = age, y = wage, color = marital, size = education)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = CPS_model_2$fitted.values), size = 1.25) +
facet_wrap(~ industry)
12 planes, because we have two (mostly) smooth predictors, a binary categorical predictor, and a six-value categorical predictor.
Construction, because it has the highest coefficient. Service is the lowest.
maritalsingle coefficient.Being single decreases wage after controlling for the other factors by nearly 6000 dollars.
Consider two people with the same age, education, industry, typical number of work hours, and health status. However, one is single and one is married.
CPS_model_5.CPS_model_5 <- lm(wage ~ marital + age + education + industry + hours + health, CPS_2018)
summary(CPS_model_5)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64886.5747 6914.18198 -9.38456275 1.171028e-20
## maritalsingle -4992.7685 1061.84882 -4.70195794 2.687274e-06
## age 1061.1410 115.83503 9.16079518 9.031462e-20
## education 3443.7625 236.12723 14.58435151 1.128646e-46
## industryconstruction 5381.3857 5959.05620 0.90306007 3.665630e-01
## industryinstallation_production 2951.0372 5854.23981 0.50408547 6.142365e-01
## industrymanagement 5107.6364 5782.95334 0.88322283 3.771832e-01
## industryservice -3074.5127 5705.56537 -0.53886207 5.900201e-01
## industrytransportation -207.3439 5940.02074 -0.03490626 9.721567e-01
## hours 732.1340 43.72488 16.74410733 2.340115e-60
## healthfair -7407.7981 2901.71339 -2.55290483 1.072955e-02
## healthgood -2470.8096 1259.44276 -1.96182766 4.987035e-02
## healthpoor -9086.9110 7657.43781 -1.18667774 2.354441e-01
## healthvery_good 292.5278 1020.89213 0.28654136 7.744823e-01
Nearly 5000 is still a meaningful amount change from baseline. There are few other categories present that are as strong. There are only two other categories, sex, disability and citizenship remaining, and they may be strongly linked to marriage which would account for its effect. Location is not a variable, and may also have an effect.
Take two workers, one of which is married and the other of which is single. The models above provided the following insights into the typical difference in wages for these two groups:
| Model | Assume the two people have the same… | Wage difference | R2 |
|---|---|---|---|
CPS_model_1 |
NA | -$17,052 | 0.067 |
CPS_model_2 |
age | -$7,500 | 0.157 |
CPS_model_3 |
age, education | -$6,478 | 0.257 |
CPS_model_4 |
age, education, industry | -$5,893 | 0.280 |
CPS_model_5 |
age, education, industry, hours, health | -$4,993 | 0.341 |
marital coefficient got closer and closer to 0 as we included more covariates in our model. Explain the significance of this phenomenon in the context of our analysis - what does it mean?It means that marital status is correlated to these covariates, ie that adding the covariates as predictors implied a marital result, diminishing the predictive value of marital status to the model.
Simplicity means that less data needs to be collected for higher confidence. Simpliciy makes the model easier to interpret.
I’m curious about the wage breakdown in different industries at different ages regarless of marriage or education.
CPS_model_4 <- lm(wage ~ age + industry, CPS_2018)
summary(CPS_model_4)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29472.6767 6815.3475 -4.32445693 1.576335e-05
## age 2091.2841 108.7642 19.22769203 5.215931e-78
## industryconstruction 9060.5366 6491.1956 1.39581939 1.628669e-01
## industryinstallation_production 7606.3564 6369.8554 1.19411761 2.325215e-01
## industrymanagement 19140.2256 6197.1591 3.08854835 2.028833e-03
## industryservice 494.1208 6184.6252 0.07989502 9.363258e-01
## industrytransportation 4466.3275 6462.1215 0.69115499 4.895189e-01
dplyr tools to complete the data drill below. (There are some structural hints in the online manual version!)# Load and use the complete CPS_2018 data
CPS_2018 <- read.csv("https://www.macalester.edu/~ajohns24/data/CPS_2018.csv")
# What are the mean and median wage?
CPS_2018 %>%
summarize(mean(wage), median(wage))
## mean(wage) median(wage)
## 1 51330.61 38000
# What is the median wage in each industry?
CPS_2018 %>%
group_by(industry) %>%
summarize(mean(wage), median(wage))
## # A tibble: 6 x 3
## industry `mean(wage)` `median(wage)`
## * <chr> <dbl> <dbl>
## 1 ag 32115. 30000
## 2 construction 44754. 36000
## 3 installation_production 46011. 38000
## 4 management 70917. 55000
## 5 service 34508. 25000
## 6 transportation 38147. 30000
# How many workers fall into each health group?
CPS_2018 %>%
group_by(health) %>%
count
## # A tibble: 5 x 2
## # Groups: health [5]
## health n
## <chr> <int>
## 1 excellent 3162
## 2 fair 537
## 3 good 2610
## 4 poor 80
## 5 very_good 3611
# Obtain a dataset of workers aged 18 to 22 that are in good health
# Show just the first 6 rows
# Hint: <= is "less than or equal to"
CPS_2018 %>%
filter(17 < age, age < 23) %>%
head(6)
## wage age education sex marital industry health hours
## 1 33000 19 16 female single management very_good 40
## 2 4000 20 12 male single installation_production very_good 54
## 3 8400 19 12 male single service excellent 20
## 4 19250 22 13 female single service excellent 40
## 5 18200 21 16 female single service excellent 25
## 6 14000 19 11 female single service excellent 40
## citizenship disability education_level
## 1 yes no bachelors
## 2 yes no HS
## 3 yes no HS
## 4 yes no some_college
## 5 yes no bachelors
## 6 yes no some_HS
# In one set of lines, calculate the median age (in years) amongst workers in excellent health
# See online manual for structural hint
CPS_2018 %>%
filter(health == "excellent") %>%
summarize(median(age))
## median(age)
## 1 38
# In one set of lines, calculate the median age (in MONTHS) amongst workers in excellent health
# See online manual for structural hint
CPS_2018 %>%
filter(health == "excellent") %>%
summarize(median(age * 12))
## median(age * 12)
## 1 456
# Load data
library(palmerpenguins)
data(penguins)
(Art by @allison_horst)
flipper_length_mm variable currently measures flipper length in mm. Create a new variable named flipper_length_cm which measures flipper length in cm. Store this inside the penguins data. Hints:
penguins %<>% mutate(flipper_length_cm = flipper_length_mm / 10)
penguin_model_1 <- lm(bill_length_mm ~ flipper_length_mm, penguins)
penguin_model_2 <- lm(bill_length_mm ~ flipper_length_cm, penguins)
penguin_model_3 <- lm(bill_length_mm ~ flipper_length_mm + flipper_length_cm, penguins)
penguin_model_4 <- lm(bill_length_mm ~ body_mass_g, penguins)
penguin_model_5 <- lm(bill_length_mm ~ flipper_length_mm + body_mass_g, penguins)
What can a penguin’s flipper (arm) length tell us about their bill length? To answer this question, we’ll consider 3 of our models:
| model | predictors |
|---|---|
penguin_model_1 |
flipper_length_mm |
penguin_model_2 |
flipper_length_cm |
penguin_model_3 |
flipper_length_mm + flipper_length_cm |
bill_length_mm vs flipper_length_mm; and bill_length_mm vs flipper_length_cm.point_and_line <- function(x){ x +
geom_point() +
geom_smooth(method = "lm", se = FALSE)}
penguins %>%
ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) %>%
point_and_line
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
summary(penguin_model_1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.2648678 3.20015684 -2.27016 2.382330e-02
## flipper_length_mm 0.2547682 0.01588914 16.03410 1.743974e-43
penguins %>%
ggplot(aes(x = bill_length_mm, y = flipper_length_cm)) %>%
point_and_line
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
summary(penguin_model_2)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.264868 3.2001568 -2.27016 2.382330e-02
## flipper_length_cm 2.547682 0.1588914 16.03410 1.743974e-43
penguin_model_2 R-squared will be less than, equal to, or more than that of penguin_model_1?penguin_model_3 R-squared will compare to that of penguin_model_1?All three R-Squared are the same because they all reflect the same correlation. Constant multiples of the data doesn’t change the fit of the line.
All the same!
summary(penguin_model_1)$r.squared
## [1] 0.430574
summary(penguin_model_2)$r.squared
## [1] 0.430574
summary(penguin_model_3)$r.squared
## [1] 0.430574
flipper_length_mm vs flipper_length_cm.The variables contribute the same information, so they have the same fit, and adding them both as predictors doesn’t actually add information for the model to use as a predictor.
penguins %>%
ggplot(aes(x = flipper_length_mm, y = flipper_length_cm)) %>%
point_and_line
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
summary(penguin_model_3), the flipper_length_cm coefficient is NA. Explain why this makes sense. HINT: Thinking about yesterday’s activity, why wouldn’t it make sense to interpret this coefficient? BONUS: For those of you that have taken MATH 236, this has to do with matrices that are not of full rank!Because the data in cm is a scaled duplicate of mm, that data contributes nothing to the model, so a coefficient would be not applicable in this context.
body_mass_gIn this exercise you’ll consider 3 models of bill_length_mm:
| model | predictors |
|---|---|
penguin_model_1 |
flipper_length_mm |
penguin_model_4 |
body_mass_g |
penguin_model_5 |
flipper_length_mm + body_mass_g |
bill_length_mm: flipper_length_mm or body_mass_g? Provide some numerical evidence.summary(penguin_model_1)$r.squared
## [1] 0.430574
summary(penguin_model_4)$r.squared
## [1] 0.3541557
Flipper length seems to be a better predictor of bill length than body mass because the r-squared of model 1 is higher than model 4.
penguin_model_5 incorporates both flipper_length_mm and body_mass_g as predictors. Before examining a model summary, ask your gut: Will the penguin_model_5 R-squared be close to 0.35, close to 0.43, or greater than 0.6?summary(penguin_model_5)$r.squared
## [1] 0.4328544
This is only slightly better than model 4.
penguin_model_5 R-squared and summarize how this compares to that of penguin_model_1 and penguin_model_4.penguins %>%
ggplot(aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).
flipper_length_mm vs body_mass_g.Body mass and flipper length are highly correlated, so body mass doesn’t add much more information to the model.
The exercises above have illustrated special phenomena in multivariate modeling:
two predictors are redundant if they contain the same exact information
two predictors are multicollinear if they are strongly associated (they contain very similar information) but are not completely redundant.
Model 3: mm / cm
Model 5: mm / mass
Remain the same.
Increase slightly.
Not only does the strategy of adding more and more and more predictors complicate our models and have diminishing returns, it can give us silly results. Construct 2 models using a sample of 10 penguins (for illustration purposes only). NOTE: If you’re using an older version of RStudio, you might get a different sample from others but the ideas are the same!
# Take a sample of 10 penguins
# We'll discuss this code later in the course!
set.seed(155)
penguins_small <- sample_n(penguins, size = 10) %>%
mutate(flipper_length_mm = jitter(flipper_length_mm))
# 2 models
poly_mod_1 <- lm(bill_length_mm ~ flipper_length_mm, penguins_small)
poly_mod_2 <- lm(bill_length_mm ~ poly(flipper_length_mm, 2), penguins_small)
# 2 R-squared
summary(poly_mod_1)$r.squared
## [1] 0.7341412
summary(poly_mod_2)$r.squared
## [1] 0.7630516
# Plot the 2 models
ggplot(penguins_small, aes(y = bill_length_mm, x = flipper_length_mm)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
ggplot(penguins_small, aes(y = bill_length_mm, x = flipper_length_mm)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE)
bill_length_mm by poly(flipper_length_mm, 9) which has 9 polynomial predictors: flipper_length_mm, flipper_length_mm^2,…, flipper_length_mm^9.poly_mod_9 <- lm(bill_length_mm ~ poly(flipper_length_mm, 9), penguins_small)
summary(poly_mod_9)$r.squared
## [1] 1
ggplot(penguins_small, aes(y = bill_length_mm, x = flipper_length_mm)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 9), se = FALSE)
bill_length_mm and flipper_length_mm? Probably not, there are weird dips and peaks.penguins_small sample? Poorly. nontics tend to get big fast away from their inversion points.It is possible to maximize r-squared at the expensive of minimizing usefulness.
The panel for “Connecting Lines” (“I clicked ‘smooth lines’ in Excel”).
There are so many models we could build! For each possible research goal, indicate which predictors you’d include in your model.
i: We want to understand the relationship between bill length and depth when controlling for penguin species. I’d include mass and bill depth. ii: We want to be able to predict a penguin’s bill length from its flipper length alone (because maybe penguins let us get closer to their arms than their nose?). Flipper length only…
Aren’t you so curious about penguins? Identify one new question of interest, and explore this question using the data. It can be a simple question and answered in 1 line / 1 set of lines / 1 plot of R code, so long as it’s not explored elsewhere in this activity.
It’s easier to weigh a penguin than measure one.
ggplot(penguins_small, aes(y = flipper_length_mm, x = body_mass_g)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE)